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Abstract 

In this article, artificial neural networks (ANN) are used for modeling the 
number of requests received by 1998 FIFA World Cup website. Modeling 
is done by means of time-series forecasting. The log traces of the website, 
available through the Internet Traffic Archive (ITA), are processed to obtain 
two time-series data sets that are used for finding the following measure¬ 
ments: requests/day and requests/second. These are modeled by training 
and simulating ANN. The method followed to collect and process the data, 
and perform the experiments have been detailed in this article. In total, 
13 cases have been tried and their results have been presented, discussed, 
compared and summarized. Lastly, future works have also been mentioned. 

Keywords: web, workload, forecasting, artificial neural networks, trace 
logs, MATLAB 


1. Introduction 

Forecasting the arrival rate of requests to a website helps the web devel¬ 
oper and provider prepare ahead and accordingly meet desired performance 
objectives. If the workload intensity — which could be expressed as incom¬ 
ing request rate or as number of concurrent users in the system [1, p. 19] [2, 
p. 58] — can be predicted, then necessary computing resources could be 
made available, thereby preventing device saturation that cause long end-to- 
end response times. 

For the purposes of predicting the future workloads, in this article, arti¬ 
ficial neural networks (ANN) are used to model the workload of 1998 FIFA 
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World Cup website, by means of time-series forecasting. Actual log traces of 
the World Cup site (1.35 billion requests) [3] are available from The Inter¬ 
net Traffic Archive (ITA) [4], Other noteworthy traces available through the 
ITA website include logs of NASA website (about 3.46 million requests) [5] 
and EPA Webserver (47,748 requests) [6]. Furthermore, ITA provides tools 
to help read and process the logs. Due to availability of the source code, 
custom enhancements to these tools are possible. 

The main reasons for choosing of neural networks for the modeling are 
their ability to approximate non-linear behavior and them being “data-driven” 
[7, 8]. It is best to assume that for real examples (just like the website data 
as in our case), the inputs and the outputs have a non-linear relationship, 
instead of assuming a linear relationship as considered by traditional ap¬ 
proaches [7]. Furthermore, with the data available it would be understand¬ 
able to directly feed it to the ANN and obtain the results of the function ap¬ 
proximation rather than judge beforehand about the nature of the functional 
relationships. This is because ANN can implement “nonlinear modeling” [7, 
p. 36] without beforehand knowledge of the input and output relationships. 

Previously, just like this article, similar works related to web traffic mod¬ 
eling and prediction using ANN have been done by Prevost et al. [9] and 
Chabaa et al. [10]. Prevost et al. [9] obtain the trace logs of NASA website 
[5] and EPA web server [6], and use ANNs and regressive linear prediction 
to guess the next few seconds of request rate, where step-ahead intervals 
range in-between 1 second and 90 seconds. For measuring performance mean- 
squared error (MSE) and root mean-squared error (RMSE) errors were cal¬ 
culated and the results of the RMSE were shown. Chabaa et al. [10] model 
1000 data points using different training algorithms (including Levenberg- 
Marquardt algorithm (LM) [11]) on multi-layer perceptron (MLP) neural 
network and compare their performances. In contrast, this article focuses 
only on LM algorithm for training purposes. Few notable differences be¬ 
tween the previous papers and this article is that 1998 FIFA World cup 
website logs depict a busier site and has lot more requests, in particular 
1.35 billion requests averaging to about 15.3 million requests/day during a 
period of 88 days, when compared to a total about 3.46 million requests of 
the NASA website and 47,748 requests of EPA server. Furthermore, in this 
article, both requests/seconds and requests/day data sets are modeled using 
ANN. Here the focus is toward one-step ahead prediction, although for one 
case two-step ahead prediction is also performed. 

Saripalli et al. [12] have also presented their work relating to workload 
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prediction; however, their approach relies on a two step process, first of which 
is associated with tracking of the workload, followed by the prediction step. 
Although, our approach does not use workload tracking, ANN could be used 
for prediction purposes in second step of the aforementioned paper. 

Giang et al. [13] in their paper compare five neural network based models 
based on their performance for the forecasting of hourly ffTTP workload of a 
commercial website. The performance is measured through Mean Absolute 
Percentage Error (MAPE). Based on the study, it is seen that non-linear 
autoregressive with exogenous input (NARX) neural networks are able to 
predict the workload best. One difference between their approach and ours is 
that they also use previous 6 values of the workload as the input to perform 
the prediction, where each input value is separated by 24 hours, i.e. past 
values at time t, t-24, t-48, etc. are used to predict the value at time t+24. 

In this article, MATLAB Neural Network Toolbox [14, 15] has been used 
for creating of neural network models and simulating them. Two data sets 
- day-requests and epoch-requests 1 — are input for training the neural net¬ 
works. The first data set is used for predicting of requests/day and the 
second for prediction of requests/second. In total 13 cases have been tried 
and their mean-squared errors (MSE) presented in this article. The first 11 
cases correspond for day-requests data and the remaining two correspond to 
epoch-requests. Through this article the applicability of ANN in modeling 
and forecasting of website workload intensity has been demonstrated; this is 
the main contribution of this work. 

The structure of the article is as follows. In section 2 the FIFA World Cup 
website data and data collection process are discussed. Section 3 provides a 
brief introduction to ANN and MATLAB. Section 4 describes the methods 
followed. Section 5 presents the results. Section 6 concludes the article. 

2. 1998 World Cup Website Data 

The 1998 World Cup commenced on June 10, 1998 and finished on July 
12, 1998 and was played by 32 teams, totalling 64 matches [17, 18]. It was 
hosted by France [18], also the team to snatch the World Cup that year, 
winning against the the defending champions Brazil in a final score of 3-0 

[19]. 


1 Epochs are integers and in this article refer to seconds passed since the epoch time, 
Jan 1, 1970 [16]. Therefore, epoch-requests data set is used for finding requests/second 
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The http://www.france98.com 2 3 website for the 1998 World Cup, was 
hosted from May 6, 1997 to serve its users with live match scores, team and 
player stats, match photographs, interviews and much more [18]. The site’s 
trace logs of received requests were collected from April 30, 1998 to July 
26, 1998, a period of 88 days comprising of total 1,352,804,107 requests [3]. 
Traces were obtained from 33 web servers that resided in four locations: one 
location in Paris, France and the rest three spread across USA [3]. The ITA 
website [4], which hosts the logs has 92 days of trace log data, the first four 
days of which are empty — representing April 26, 1998 as day 1 — and used 
as a filler to help with identifying weekdays. Each day’s log data is further 
divided into files of maximum 7 million requests, thereby limiting the hie size 
to within 50 MB and causing one day log data to be associated with multi¬ 
ples hies [3]. For example, day 38 is divided into two hies: wc_day38_l. gz 
(6,999,999 requests) and wc_day38_2.gz (188,042 requests). There are in 
total 249 binary hies, which need further processing to read their contents. 
For this purpose, the 1998 FIFA log site [3] includes the following three tools 
useful here: read, recreate and checklog. The read tool aids in counting 
of number of requests in each hie, the recreate tool displays the log con¬ 
tents after converting them from binary, and checklog presents the request 
statistics from the information in the binary hies. For readers who are inter¬ 
ested in an in-depth discussion and analysis of the 1998 World cup website 
workload may refer to article by Arlitt and Jin [18]. 

Figure 1 shows the fluctuating requests/day graph, depicting the requests 
received by the website. This graph is derived by plotting the day (x-axis) and 
requests (y-axis) columns of the day-requests data set, which was extracted 
from the trace logs. As seen, the popularity of the website increased as 
approaching the beginning of the World cup and decreased at a quick rate 
after the end of matches. The highest workload was witnessed on day 66 
(June 30, 1998) accounting for a total of 73,291,868 requests. 

Figure 2 and Figure 3 show the requests/second vs. epoch graph for the 
first 1000 points of day 6 and day 66-part 10 ?> hies, respectively. It is key to 


2 The URL http: //www. f rance98. com does not appear to correspond to the 1998 FIFA 
World Cup anymore. Interested readers, may visit the 1998 FIFA Archive available at 
http://www.fifa.com/worldcup/archive/edition=1013/index.html, which includes 
photographs and archived information of the 1998 World Cup. 

3 As discussed earlier in this section, each days data is divided into files of maximum 7 
million requests because of which day 66 log files are divided into 11 parts. For this article 
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Figure 1: Requests/day graph showing the fluctuating workload received by the 1998 
World cup website before, during and after the World cup matches were held. The trace 
logs were collected from April 30, 1998 although the graph includes empty values from 
April 26, 1998 to help identify weekdays. The world cup started on June 10, 1998 and 
ended on July 12,1998. The highest workload was witnessed on day 66 (June 30, 1998) 
accounting for a total of 73,291,868 requests. 


realize the distinction between day 6 and the day 66 hies in general. Day 6 
have fewer requests/second with maximum rate around 50 requests/second 
however day 66 has a higher request rate with maximum rate around 3300 
requests/second with a higher fluctuation per second as seen from the graph. 
These graphs have been derived by processing the log traces to obtain epoch- 
requests data set and then using the latter for graphing. On this note, the 
following: section 2.1 and section 2.2 describe the data collection and data 
format of the data sets used in this article. 

2.1. Data Collection 

To collect data and generate the two required data sets: day-requests and 
epoch-requests , the first task was to download the log trace hies from the ITA 
website, wget utility, which is available on Linux-based operating systems, 
was used for downloading the hies; however, any browser may be used to 
easily click and download the hies. The downloaded hie size amounted to 
about 9 GB and the hies were placed in a Log Trace folder on the hard 
drive. Afterwards, the read program was run against each of the 249 binary 


we use the 10th file for our purposes. 
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Figure 2: Requests/second vs. epoch graph for day 6 (first 1000 data points) 

Requests-per-second vs. Time (Epoch) - 1000 data points (day66_10) 
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Figure 3: Requests/second vs. epoch graph for day 66-partlO (first 1000 data points) 


files in the Log Trace folder to count the number of requests in each and 
subsequently use the information to find number of requests/day for each 
day, thereby generating the day-requests data set. 

Obtaining the epoch-requests data set required further processing steps, 
therefore a driver. sh shell script and two programs: read_test and duplicates 
were custom written to automate the process. The read_test program was 
developed by modifying the read program — needing single line source addi¬ 
tion — to process each single log trace hie, output the epoch of each request 
and generate the intermediary data set hies: epoch-frequency. Each of these 
data hies contained epochs occurring zero or multiple times, e.g. if an epoch 
occurred 10 times in this hie then 10 requests were received during that 
particular epoch (refer Figure 4). The duplicates program was written to 
process the epoch-frequency data and generate the epoch-requests data set. 

The program simply noted the epoch integer and the number of times they 
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occurred, outputting each epoch and the associated frequency, thereby pro¬ 
ducing a condensed data set. To begin the whole process, driver, sh was 
run which invoked the readiest and duplicates programs to automati¬ 
cally retrieve epoch-requests for all the hies in the Log Trace folder. Figure 4 
and the following describes the steps that driver.sh script performs when 
run: 

Step 1: Chooses one hie from Log Trace folder. Chosen File: wc_dayXX_Y.gz 

Step 2: Invokes read_test program on the chosen hie to extract epoch- 
frequency data. Output: wc_dayXX_Y.gz.log. 

Step 3: Invokes duplicates program to generate epoch-requests data. 
Output: wc_dayXX_Y. gz. count. txt 

Step 4: Repeat steps 1-3 above until no hies remains to be processed. 
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Figure 4: Process Seconds Data 
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2.2. Data Format 

This sub-section describes briefly how data in the day-requests and epoch- 
requests data sets are organized. Based on the data collection process as 
discussed in section 2.1 the day-requests data set contains the day and cor¬ 
responding requests received that day, labeled as ‘DAY’ and ‘REQUESTS’ 
columns respectively. Also added manually to the data set are the number 
of matches that were played for each day labelled as ‘MATCHES’. Another 
column ‘ISMATCH’ was added to indicate if a match was played on that 
day, if so then 1 was used as the value and 0 otherwise. For example, if two 
matches were played on a particular day, then ‘ISMATCH’ would have the 
value 1 and the ‘MATCHES’ would be set to 2 for that day. The data for 
the number of matches for each day was obtained from [20], however it was 
later found that FIFA Archive also includes the information [17]. Following 
shows a short sample of how day-requests data set is organized: 

DAY REQUESTS MATCHES ISMATCH 

45 20068724 0 0 

46 50395084 2 1 

47 52406319 2 1 

48 48956621 3 1 

49 23528986 3 1 

50 21093494 3 1 

51 58013849 3 1 

52 40732114 2 1 


Based on the data collection process as discussed in section 2.1 the epoch- 
requests data set contains the epoch and corresponding requests — i.e., re¬ 
quests/second — received during that epoch labeled as ‘EPOCH’ and ‘RE¬ 
QUESTS’ columns respectively. Following shows a sample of how epoch- 
requests data set is organized: 


EPOCH REQUESTS 
898207201 145 
898207202 242 
898207203 276 
898207204 283 
898207205 285 


3. ANN and Time-series forecasting 

This section begins with a brief introduction to ANN, followed by discus¬ 
sion of ANN as a method for time-series forecasting [21]. The tools provided 
by MATLAB that aid in ANN time-series forecasting are also mentioned. 

3.1. ANN 


yi 

outputs 
y2 


Output 
layer 

Hidden 
layer 

Figure 5: ANN model showing the input, hidden and output layers. This is an example 
of feedforward neural networks where the output generated from a layer is fed as input to 
the next layer [8]. 

The concept of artificial neural networks (ANN) is derived from the hu¬ 
man nervous system, where a network of neurons, i.e. neural network, pro¬ 
cesses signals [10, 8, 7]. Each neuron in ANN receives inputs, and processes 
them based on mathematical functions and relations, generating an output 
that is either fed as input to another neuron or served as the output of the 
whole ANN (refer Figure 5). The layer comprising the input signals is known 
as input layer, the last layer of neurons that generate the final outputs is the 
output layer and the layers between input and output layers are known as 
hidden layers. In the case of multi-layer perceptron (MLP), a.k.a. feedfor¬ 
ward neural networks, the output generated from a layer is fed as input to 
the next layer [8, 10]. The connections between neurons have weights, which 
affect the output of the network. Other factors that affect the output are 
the bias value of the neuron and the transfer function /, which are both ex¬ 
plained through the equation below. The output of neuron i with R inputs 
is described by the equation [15, p. 1-7]: 

R 

output = f (y^(weightjj * inputj) + bias ) 

3 =1 
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There could be various transfer functions such as sigmoid (tansig) and 
linear (purelin) [15]. For our purposes we use sigmoid function for hidden 
layers and the linear function is used for the output layer. Sigmoid transfer 
function is described as follows [10]: 


f(z) 


1 

1 + e~ x 


Before using the neural network, training data comprising of rows of in¬ 
puts and desired outputs are fed to the neural network for training purposes, 
which adjusts the weights of the connections based on a selected training al¬ 
gorithm. Back propagation [10] is a well-known training algorithm. For the 
purposes in this article, Levenberg-Marquardt backpropagation [15] training 
algorithm has been used. Once the network it trained, test inputs are fed to 
the network and the network is simulated to produce the outputs. 


3.2. Time-series forecasting and MATLAB 

ANN models are useful for Time-series forecasting [21, 15], where the 
future outcome of a variable is predicted through the use of current and 
previous time values of the variables. ANN have been applied in forecast 
relating to finance and markets, electric power load, sunspots, temperature 
of environment, airline passengers, etc. [7]. Interested readers may refer to 
[14, pp. 1-6-1-7] and [7, pp. 39-40] for further applications of ANN. 

In general, for time-series forecasting the output y(t) is predicted based 
on previous d delayed inputs. An example is nonlinear auto autoregressive 
(NAR) prediction provided by MATLAB which is based on the following 
equation [14]: 


y(t) = f{y{t-l),...,y{t-d)) 

The mean squared error (MSE), which is a method for determining the 
error of prediction, is calculated as follows based on N predictions, the target 
input values t. t and the predicted values a* [15, p. 2-16]: 

1 N 

MSE = jv E (*• - a -) 2 

Z— 1 

Figure 6 shows the different types of forecasting. Forecasting can be 
subdivided into one-step ahead or multi-step ahead forecasting [7] . One-step 
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Forecasting 



One-step ahead Multi-step ahead 



Direct method Iterative method 

Figure 6: Forecasting can be subdivided into one-step ahead or multi-step ahead forecast¬ 
ing [7]. Multi-step ahead could be further classified into direct and iterative forecasting 

[7]- 


ahead forecasting generates a single output predicting the value of y for the 
next time-step only, whereas multi-step ahead forecasting predicts value of 
y for m future time-steps, i.e. y(t),y(t + 1),... ,y(t + m — 1). Multi-step 
forecasting could further be classified into direct and iterative forecasting 
approaches [7]. In direct approach, there are multiple output nodes, whereas 
in iterative method, a single output is looped back as input to iteratively 
predict future values [7]. In this article, the main focus is on one-step ahead 
forecasting, however, a simple multi-step prediction has also been performed. 


MATLAB Time Series Tool 
(ntstool) 



Non-linear Autoregressive Non-linear Autoregressive with exogenous input Non-linear Input Output 
(NAR) network (NARX) network (NIO) network 

Figure 7: nstool 

In this article, MATLAB is used as a tool for time-series forecasting. To 
help with the forecasting and with ANNs in general, MATLAB provides the 
Neural Network Toolbox [14, 15]. ANN can easily be created, trained and 
simulated through the toolbox. Alongside, graphing options to show error, 
performance and response is available. For using the toolbox one can feed 
commands to the command-line (also by running scripts) or use GUI-based 
aids [14]. The focused time-delay neural network (FTDNN) [15] is a simple 
time-series prediction network that can be created through the command-line 
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by calling the command timedelaynet. This is same as the one-step ahead 
prediction described above. For GUI options, the Time Series Tool (ntstool) 
[14] graphical interface (Figure 7) provides users the option to choose from 
different predefined ANN: Non-linear Autoregressive (NAR), Non-linear Au¬ 
toregressive with exogenous input (NARX) and Non-linear Input Output 
(NIO) networks. NAR and NARX network are networks where output is 
fed-back into the network for prediction purposes, however, during training 
the feedback loop could be left open as original inputs are available and closed 
later for simulation [15]. The main distinction between NAR and NARX net¬ 
works is the latter not only uses the variable to be predicted as inputs but 
also another set of delayed inputs: x(t), x(t — 1),..., x(t — d) for the forecast 
of y(t) [15]. Finally, the NIO network only uses x(t),x(t — 1),... ,x(t — d) 
inputs for prediction of y(t). The GUI allows saving the actions performed 
- including network creation, training and testing — as an auto-generated 
script which can either be modified or directly run through the command 
line. 

Networks can be trained in two ways (Figure 8). If batch training used 
is then the weights and biases of the network are updated when all of the 
input-output rows have been provided to the network, whereas, in the case 
of incremental training, each input-output row updates the network [15]. 
In this article, FTDNN and batch training are mostly used for time-series 
forecasting; however, incremental training and NARX network have also been 
tried in different cases, details of which are available in the next section. 


Training 



Batch Incremental 
Figure 8: Batch and Incremental training 


4. Method 

This section describes the method followed to model the 1998 World Cup 
website requests using ANN. The aim is to evaluate the performance of dif¬ 
ferent networks and determine how well the requests rate were predicted 
through the models. To begin, the first step was data collection, which has 
been described is detail in Section 2.1. Once data was available as two data 
sets: day-requests and epoch-requests , the next step required MATLAB to 
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create ANN, train them and finally simulate the network to perform the pre¬ 
diction. For this purpose, multiple MATLAB scripts were used to automate 
some portions of the process. Through the use of ntstool, few scripts were 
initially auto-generated, which were then modified for the purposes described 
here, while the other scripts were custom written. The following four scripts 
were used: 

1. createNetwork.ra: create the ANN based on chosen network structure. 

2. trainSimNetwork .m: train and simulate the ANN. 

3. initNetwork .m: initalize network weights. This is useful to begin with 
weights that might be trained to reach a better performance. 

4. revertNetwork.ra: revert to the previous network weights just before 
initNetwork. m was called. This is useful if after calling initNetwork. m 
and then training, the network showed poor performance and in which 
case the previous network weights were better. 

Once the scripts were developed the following process was manually fol¬ 
lowed (Figure 9): 

Step 1: Execute createNetwork.m script. 

Step 2: Execute trainSimNetwork.m script. Go to step3b (initNetwork.m) 

if performance is better than previous network, else goto step 3a (revertNetwork. m). 

Step 3a: Execute revertNetwork.m script. 

Step 3b: Execute initNetwork.m script. 

Step 4: Repeat steps 2-3 above for another four times. 

To evaluate different networks, 13 cases were tried and their results an¬ 
alyzed. The process as described above using the MATLAB scripts were 
followed for each case with minor script modifications as per the require¬ 
ments. Table 1 lists the 13 cases. Dl-Dll use the day-requests data set 
and S12-S13 use the epoch-requests. D1 describes the base network and 
subsequent cases include modifications to this network’s structure, data dis¬ 
tribution (training/validation/test), training mode and/or input delays. D1 
uses a FTDNN to perform one-step ahead prediction and has two input de¬ 
lays. The data is trained using batch mode and the contiguous distribution 
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Figure 9: Custom MATLAB Scripts: Process Flow Diagram 

for training/validation/testing is 70%/15%/15%. All cases except D9-D11 
use open-loop models, whereas D9-D11 use closed-loop networks for simu¬ 
lation, although their training is performed using open-loop network. D9 
uses NAR network and DIO and Dll use NARX network. The x{t) input 
for DIO and Dll are ‘MATCHES’ and ‘ISMATCH’ columns of the data 
set, respectively. The epoch-requests data set is modeled by S12, which is 
sub-divided into cases S12a and S12b. S12a models the first 1000 rows of 
wc_day6_l. gz. count. txt hie and 12b using the same network simulates the 
whole wc_day6_l.gz.count.txt hie. The last one, case 13, models the be¬ 
ginning 1000 points of wc_day66_10.gz.count.txt data set. 

Table 1: A List of cases that have been tested using 
MATLAB Neural Network Toolbox. 


Case 

Description 

D1 

i) FTDNN ii) Batch training mode 

iii) day-requests data (92 points) 

iv) training/validation/testing distribu¬ 
tion: 70%/15%/15% (contiguous data) 

v) Delays = 1:2, i.e. inputs = y(t-l) 
and y(t-2) vi) one-step ahead prediction 
vii) hiddenLayerSize = 10 viii) one hidden 
layer 

D2 

Same as D1 except: hiddenLayerSize = 30 

D3 

Same as D1 except: two hidden layers 


Continued on next page 
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Case 

Description 

D4 

Same as D1 except: train¬ 
ing/validation / testing distribution: 

80%/10%/10% 

D5 

Same as D1 except: train¬ 
ing/validation/testing distribution: 

60%/20%/20% 

D6 

Same as D1 except: hiddenLayerSize = 1 

D7 

Same as D1 except: Incremental (adapt) 
training mode 

D8 

Same as D1 except: Delays = 1:7 

D9 4 

Same as D1 except: NAR network. Delays 
= 2:3, i.e. inputs = y(t-2) and y(t-3). 2-step 
ahead prediction, (closed-loop) 

DIO 4 

Same as D1 except: NARX network. Exoge¬ 
nous input is ‘MATCHES’. Uses open-loop 
for training and closed-loop for simulation. 

Dll 4 

Same as D1 except: NARX network. Exoge¬ 
nous input is ‘ISMATCH’. Uses open-loop 
for training and closed-loop for simulation. 

SI 2a 

Same as D1 except: Data is Sec¬ 

onds data beginning 1000 points of 
wc_day6_l. gz. count. txt epoch-request 

data hie . 

S12b 5 

Same as S12a except Simulation only. 
Network from S12a is used for simula¬ 
tion . The data contains all points of 
wc_day66_10. gz. count. txt epoch-requests 
file 

S13 

Same as S12a except: Data is Sec¬ 
onds data beginning 1000 points of 
wc_day66_10. gz. count. txt epoch-request 
hie . 


4 Closed-loop 

5 Simulation only. Uses network trained from S12a. 
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After trying each case, the results which include mean-squared error 
(MSE) and the correlation coefficient R of the of the modeled data. These 
results are analyzed in the following section. 

5. Results 

In this section, the results from studying the 13 cases — described earlier 
in section 4 — are presented. The main results are the complete perfor¬ 
mance in MSE of the network (for all training, validation and testing) and 
the correlation coefficient R (Table 2). Four figures (Figure 10, Figure 11, 
Figure 12, Figure 13), which graphically show the requests vs. day and re¬ 
quests vs. seconds graphs for cases Dl, S12a, S12b and S13 respectively, are 
also discussed. 


Requests-per-day vs Time (Day) 



Figure 10: Case Dl - Requests-per-day vs. Day. Day 1 value on the graph corresponds to 
the first prediction and since there are 1:2 delays then Day 1 here corresponds to actual 
data set’s Day 3 (Day 1 and Day 2 values of actual data set as delayed inputs). 

In Figure 10 the x-axis represents the predicted time steps in days, there¬ 
fore, day 1 here corresponds to first predicted day based on 2 delayed inputs, 
as case Dl uses two delayed inputs. To clarify further, for Dl, the “actual” 
data set’s day 1 and day 2 are the delayed inputs and the first actual pre¬ 
dicted day is day 3, i.e. day 1 in the graph corresponds to day 3 of the 
data set. The dashed blue line indicated the actual requests/day values (i.e. 


16 


























targets) and the solid green line is the response by simulating the neural net¬ 
works. The first contiguous 70% of the data, shown in purple cross points, 
are the training data, the next 15% in blue cross points are the validation 
data and the remaining in red are the testing data. From the graph it is seen 
that that the ANN has been able to model the requests/day reasonably well. 
The MSE is 0.0127 and R is 0.90134, the latter showing that there is strong 
relation between the targets and the response. 


Requests-per-second vs. Seconds 



Figure 11: Case S12a - Requests-per-second vs. Seconds of first 1000 points of day 6. 
Second 1 value on this graph corresponds to the first prediction and since there are 1:2 
delays then Second 1 here corresponds to actual data set’s Second 3, where Second 1 and 
Second 2 values of actual data set are the delayed inputs. 

In Figure 11 the x-axis displays the predicted time steps in seconds, there¬ 
fore, second 1 here corresponds to first predicted second based on 2 delayed 
inputs, following the same explanation as for graph in Figure 10. In the 
graph one-step ahead prediction has been used. The target requests/seconds 
are shown in dashed blue line and the solid red line represents the outputs. 
From the graph the ANN shows to model the requests/second to a reason¬ 
ably good degree of accuracy along with a network performance of 0.0163. 
Correlation coefficient R = 0.5043, depicting a positive but not a strong 
correlation between the targets and the response. 

Similarly, Figure 12 shows the simulation from using the same network 
as obtained from case S12a. The MSE for case 12b is 0.0151, which is com- 
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Requests-per-second vs. Seconds 



Time (seconds) x io 4 


Figure 12: Case S12b - Requests-per-second vs. Seconds of complete day 6. Second 1 
value on this graph corresponds to the first prediction and since there are 1:2 delays then 
Second 1 here corresponds to actual data set’s Second 3, where Second 1 and Second 2 
values of actual data set are the delayed inputs. 


paratively better than case 12b. 

Figure 13 shows the graph for the first 1000 seconds of day 66-partlO. The 
MSE for case 13 is 0.0125, and the correlation coefficient R = 0.65686, overall 
depicting a reasonable prediction of the requests/second for day 66-partlO. 

Table 2: Results of the 13 cases 


Case 

Description 

Complete Perf. (MSE) 

R 

D1 

Base network using day- 
requests 

0.0127 

0.90134 

D2 

hiddenLayerSize = 30 

0.0295 

0.84082 

D3 

two hidden layers 

0.0087 

0.92925 

D4 

80%/10%/10% (data) 

0.0078 

0.93894 

D5 

60%/20%/20% (data) 

0.0252 

0.84021 

D6 

hiddenLayerSize = 1 

0.0188 

0.8457 

D7 

Incremental training 

0.0768 

0.57631 

D8 

Delays = 1:7 

0.0431 

0.80433 


Continued on next page 
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Case 

Description 

Complete Perf. (MSE) 

R 

D9 6 

NAR network. 2-step 

ahead prediction (closed- 
loop) 

0.0682 

0.50645 

DIO 6 

NARX network. Exoge¬ 
nous input: ‘MATCHES’ 
(closed-loop) 

0.0779 

0.63225 

Dll 6 

NARX network. Exoge¬ 
nous input: ‘ISMATCH’ 
(closed-loop) 

0.0498 

0.54486 

S12a 

epoch-seconds data set 
(1000-points of day 6) 

0.0163 

0.5043 

S12b 7 

Network from S12a. 
Complete day 6 data. 

0.0151 

0.61852 

S13 

epoch-seconds data set 
(1000-points of day 
66_10) 

0.0125 

0.65686 


Table 2 summarizes the results of the networks evaluation. Here, the 
result discussions are presented. Amongst the day-requests data set cases, 
the best performers are D3, D4 and D1 with performance range between 
0.0078-0.0127 and correlation coefficient range between 0.90134-0.93894. In 
particular, case D4 network has the highest performance. D4 uses 80% of 
the data for training, a possible reason for having better results than others. 
Using two hidden layers (case D3) network results in better performance than 
using one hidden-layer (base case Dl). 

Cases D2, D5, D6 and D8 show average performance with MSE ranging 
between 0.0188-0.0431 and correlation coefficient between 0.80433-0.8457. 
Having 30 neurons or 1 neuron in the hidden layer — as in cases D2 and 
D6, respectively — instead of using 10 neurons (base case Dl), does not 
help increase the performance and on the other hand causes a decrease. 
From the results it is seen that using more data for training provides better 
performance. This is evident from case D4, which uses 80% data for training, 
show better results in comparison to Dl, using 70% data for training, which 


6 Closed-loop 

'Simulation only. Uses network trained from S12a. 
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Figure 13: Case S13 - Requests-per-second vs. Seconds of first 1000 points of day 66- 
partlO. Second 1 value on this graph corresponds to the first prediction and since there 
are 1:2 delays then Second 1 here corresponds to actual data set’s Second 3, where Second 
1 and Second 2 values of actual data set are the delayed inputs. 

in turn has a better performance in comparison to D5 using 60% of data 
for training. With two delays (base case Dl) the prediction shows better 
performance in comparison to seven delay inputs (case D8). 

Cases D7, D9, DIO and Dll are the worst performers with MSE ranging 
between 0.0498-0.0779 and correlation coefficient between 0.50645-0.63225. 
The results show that closed-loop NARX network (cases D10 and Dll) which 
have exogenous input of ‘ISMATCH’ or ‘MATCHES’ show poor performance 
and low correlation with targets and response. Furthermore, iterative two- 
step ahead prediction (case D9) also doesn’t not indicate good performance in 
comparison to one-step ahead performance (case Dl). Alongside, incremental 
training (case D7) does not fare well in comparison to batch training. 

Amongst the epoch-requests data set cases, S12a and S12b model 1000- 
data. points and complete data-points of day 6 requests, respectively. S12a 
and S12b network have a performance of 0.0163 and 0.0151, respectively. 
The network trained for S12a shows comparatively better performance when 
used to predict the data of S12b than predicting data of 12a. Day 66 data 
(1000-points) is modeled by network in case S13 and has a MSE of 0.0125 
and correlation of 0.65686. 
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Summarizing the results above, it is seen that if more percentage of data is 
used for training then the network is able to perform a better prediction. Two 
hidden-layers shows better results than one hidden layers based on the data 
used. Also using too few or too many neurons decreases performance and in 
particular, based on the data modeled, use of 10 neurons comparatively shows 
better performance. Batch training also performs better than incremental 
training. Furthermore, using exogenous inputs in the day-requests data set 
did not help improve the performance. Finally, one-step ahead prediction 
shows better results than iterative multi-step ahead prediction. Based on 
the results — when data distribution for training, validation and simulation 
is not varied and when batch training is employed — a network with two- 
hidden layers and 10 hidden layer size shows the best performance. 

6. Conclusions 

In this article, the workload intensity of FIFA World Cup website has 
been modeled by using ANN. Artificial neural networks have been employed 
for time-series prediction of two data sets: day-requests and — day 6 and 
day 66-partlO of — epoch-requests. In total, 13 cases have been studied and 
compared. One base network is used and subsequent cases include modifica¬ 
tions to this network’s structure, data distribution (training/validation/test), 
training mode and/or input delays. The method followed to collect and pro¬ 
cess data, and perform the experiments have been detailed in this article. 
The networks were created, trained and simulated using MATLAB Neural 
Network Toolbox. The results of all the cases have been presented, discussed, 
compared and summarized. Based on the results — when data distribution 
for training, validation and simulation is not varied and when batch training 
is employed — a network with two-hidden layers and 10 hidden layer size 
shows the best performance. This network has shown to model the requests 
intensity with reasonable accuracy, as seen from the MSE and correlation 
coefficient. 

As a future work, the relationship between the website workload intensity 
and the audience of the matches could be found. On the same note, the 
prediction of the expected audience, using website workload intensity as an 
input, could be made by employing artificial neural networks. The popularity 
and rankings of teams could also serve as an input for forecasting of website 
request rate and the expected audience in upcoming matches. Study into how 
network structure could be modified — or other means such as adding inputs 
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- to help improve incremental and multi-step ahead prediction would be 
fruitful. ANN models could also be compared with other linear and non-linear 
prediction approaches and the results analyzed. If website and audience data 
is collected over the years then a more accurate prediction appears possible. 
Such assumptions could be tested and verified. 

Through this article the applicability of ANN in modeling and forecasting 
of website workload intensity has been demonstrated, which is not restricted 
to FIFA World Cup website only. ANN modeling could also be used for 
other websites of sporting events (e.g. Super Bowl or Stanley Cup) — and 
any website in general — thereby again establishing the role of ANN in 
forecasting and expanding the horizons of their use. 
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